| Country | Region | Happiness.Rank | Happiness.Score | Standard.Error | Economy..GDP.per.Capita. | Family | Health..Life.Expectancy. | Freedom | Trust..Government.Corruption. | Generosity | Dystopia.Residual |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Switzerland | Western Europe | 1 | 7.587 | 0.03411 | 1.39651 | 1.34951 | 0.94143 | 0.66557 | 0.41978 | 0.29678 | 2.51738 |
| Iceland | Western Europe | 2 | 7.561 | 0.04884 | 1.30232 | 1.40223 | 0.94784 | 0.62877 | 0.14145 | 0.43630 | 2.70201 |
| Denmark | Western Europe | 3 | 7.527 | 0.03328 | 1.32548 | 1.36058 | 0.87464 | 0.64938 | 0.48357 | 0.34139 | 2.49204 |
In addition, we wanted to add futher factors and added the following three datasets:
By merging the datasets we have now four additional factors:
To join all the different datasets we had to do some manual preprocessing which can be seen in the preprocessing step. The main steps where cleaning the data (region, countrycode, NaN) and joining the datasets based on the year and the countrycode.
After joining we noticed, that the three additional data sets do not contain data for the whole timespan 2015-2022.(fig. missing values full data) Therefore, we decided to create two datasets. One for analysing the happiness change over time and one for analysing the influential factors regarding happiness in only one year.
For the first dataset, the over time analysis, we only included the 6 factors from the base happiness dataset and excluded all rows containing missing values. We also renamed the columns for having shorter labels.| Country | Happiness.Rank | Happiness | Economy | Family | Health | Freedom | Trust | Generosity | Year | Region |
|---|---|---|---|---|---|---|---|---|---|---|
| Switzerland | 1 | 7.587 | 1.39651 | 1.34951 | 0.94143 | 0.66557 | 0.41978 | 0.29678 | 2015 | Western Europe |
| Iceland | 2 | 7.561 | 1.30232 | 1.40223 | 0.94784 | 0.62877 | 0.14145 | 0.43630 | 2015 | Western Europe |
| Denmark | 3 | 7.527 | 1.32548 | 1.36058 | 0.87464 | 0.64938 | 0.48357 | 0.34139 | 2015 | Western Europe |
For the second dataset, the influential factors analysis, we inspected the missing values of each year and choose the year with the lowes missing values, year 2018 (fig “missing values 2018”). Then we excluded all rows containing missing values again. Figure “missing values 2017” shows e.g. that the smoking and the alcohol dataset did not contain any values for the year 2017. We also renamed the columns for having shorter labels.
| Country | Happiness.Rank | Happiness | Economy | Family | Health | Freedom | Trust | Generosity | Year | Region | Country.Code | Code | Alcohol | Population | Tobacco | Internet |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Finland | 1 | 7.632 | 1.305 | 1.592 | 0.874 | 0.681 | 0.393 | 0.202 | 2018 | Western Europe | FI | FIN | 10.78 | 5522585 | 19.7 | 88.88996 |
| Norway | 2 | 7.594 | 1.456 | 1.582 | 0.861 | 0.686 | 0.340 | 0.286 | 2018 | Western Europe | NO | NOR | 7.41 | 5337960 | 13.0 | 96.49166 |
| Denmark | 3 | 7.555 | 1.351 | 1.590 | 0.868 | 0.683 | 0.408 | 0.284 | 2018 | Western Europe | DK | DNK | 10.26 | 5752131 | 18.6 | 97.31920 |
missing values full data
missing values 2017
missing values 2018
One of the objectives of preliminary data analysis to get a feel for the data you are dealing with by describing the key features of the data and summarizing the results. We are focusing on the second dataset, the influential factors analysis dataset, as it contains the most explanatory variables.
First we check via the summary how all the explanatory variables are distributed. As we can see they are on different scales, especially “population” and “Internet usage”. As we don’t want to have the following analysis be more driven on the larges distances, we scale them by \(\frac{(x - mean(x))}{sd(x)}\)
## Happiness Economy Family Health
## Min. :2.905 Min. :0.0760 Min. :0.372 Min. :0.0000
## 1st Qu.:4.486 1st Qu.:0.7040 1st Qu.:1.063 1st Qu.:0.4475
## Median :5.483 Median :1.0100 Median :1.314 Median :0.6750
## Mean :5.489 Mean :0.9335 Mean :1.247 Mean :0.6283
## 3rd Qu.:6.332 3rd Qu.:1.2240 3rd Qu.:1.481 3rd Qu.:0.8180
## Max. :7.632 Max. :1.5760 Max. :1.644 Max. :1.0080
## Freedom Trust Generosity Alcohol
## Min. :0.0250 Min. :0.0000 Min. :0.0000 Min. : 0.003
## 1st Qu.:0.3875 1st Qu.:0.0500 1st Qu.:0.1020 1st Qu.: 3.220
## Median :0.5040 Median :0.0880 Median :0.1670 Median : 7.150
## Mean :0.4758 Mean :0.1195 Mean :0.1840 Mean : 6.842
## 3rd Qu.:0.5835 3rd Qu.:0.1450 3rd Qu.:0.2545 3rd Qu.:10.385
## Max. :0.7240 Max. :0.4570 Max. :0.5980 Max. :15.090
## Population Tobacco Internet
## Min. :3.367e+05 Min. : 3.70 Min. : 4.10
## 1st Qu.:5.488e+06 1st Qu.:13.90 1st Qu.:37.60
## Median :1.444e+07 Median :22.20 Median :68.21
## Mean :6.007e+07 Mean :22.02 Mean :60.43
## 3rd Qu.:4.430e+07 3rd Qu.:27.90 3rd Qu.:82.81
## Max. :1.428e+09 Max. :45.50 Max. :99.60
box <- ggplot(data_2018, aes(x = Region, y = Happiness, color = Region), ) +
geom_boxplot() +
geom_jitter(aes(color=Country), size = 0.5) +
ggtitle("Happiness Score for Regions and Countries") +
coord_flip() +
theme(legend.position="none")
ggplotly(box)
colnames(data_2018)
## [1] "Country" "Happiness.Rank" "Happiness" "Economy"
## [5] "Family" "Health" "Freedom" "Trust"
## [9] "Generosity" "Year" "Region" "Country.Code"
## [13] "Code" "Alcohol" "Population" "Tobacco"
## [17] "Internet"
correlation_data <- data_2018[,correlation_categories]
ggpairs(correlation_data, title="correlation matrix for influential factors analysis", )
geography map (color each country base on the percentage change over time (2015-2022))